17 research outputs found

    Learning space-time structures for action recognition and localization

    Get PDF
    In this thesis the problem of automatic human action recognition and localization in videos is studied. In this problem, our goal is to recognize the category of the human action that is happening in the video, and also to localize the action in space and/or time. This problem is challenging due to the complexity of the human actions, the large intra-class variations and the distraction of backgrounds. Human actions are inherently structured patterns of body movements. However, past works are inadequate in learning the space-time structures in human actions and exploring them for better recognition and localization. In this thesis new methods are proposed that exploit such space-time structures for effective human action recognition and localization in videos, including sports videos, YouTube videos, TV programs and movies. A new local space-time video representation, the hierarchical Space-Time Segments, is first proposed. Using this new video representation, ensembles of hierarchical spatio-temporal trees, discovered directly from the training videos, are constructed to model the hierarchical, spatial and temporal structures of human actions. This proposed approach achieves promising performances in action recognition and localization on challenging benchmark datasets. Moreover, the discovered trees show good cross-dataset generalizability: trees learned on one dataset can be used to recognize and localize similar actions in another dataset. To handle large scale data, a deep model is explored that learns temporal progression of the actions using Long Short Term Memory (LSTM), which is a type of Recurrent Neural Network (RNN). Two novel ranking losses are proposed to train the model to better capture the temporal structures of actions for accurate action recognition and temporal localization. This model achieves state-of-art performance on a large scale video dataset. A deep model usually employs a Convolutional Neural Network (CNN) to learn visual features from video frames. The problem of utilizing web action images for training a Convolutional Neural Network (CNN) is also studied: training CNN typically requires a large number of training videos, but the findings of this study show that web action images can be utilized as additional training data to significantly reduce the burden of video training data collection

    Learning activity progression in LSTMs for activity detection and early detection

    Full text link
    In this work we improve training of temporal deep models to better learn activity progression for activity detection and early detection tasks. Conventionally, when training a Recurrent Neural Network, specifically a Long Short Term Memory (LSTM) model, the training loss only considers classification error. However, we argue that the detection score of the correct activity category, or the detection score margin between the correct and incorrect categories, should be monotonically non-decreasing as the model observes more of the activity. We design novel ranking losses that directly penalize the model on violation of such monotonicities, which are used together with classification loss in training of LSTM models. Evaluation on ActivityNet shows significant benefits of the proposed ranking losses in both activity detection and early detection tasks.https://www.cv-foundation.org/openaccess/content_cvpr_2016/html/Ma_Learning_Activity_Progression_CVPR_2016_paper.htmlPublished versio

    Do less and achieve more: Training CNNs for action recognition utilizing action images from the Web

    Full text link
    Recently, attempts have been made to collect millions of videos to train Convolutional Neural Network (CNN) models for action recognition in videos. However, curating such large-scale video datasets requires immense human labor, and training CNNs on millions of videos demands huge computational resources. In contrast, collecting action images from the Web is much easier and training on images requires much less computation. In addition, labeled web images tend to contain discriminative action poses, which highlight discriminative portions of a video’s temporal progression. Through extensive experiments, we explore the question of whether we can utilize web action images to train better CNN models for action recognition in videos. We collect 23.8K manually filtered images from the Web that depict the 101 actions in the UCF101 action video dataset. We show that by utilizing web action images along with videos in training, significant performance boosts of CNN models can be achieved. We also investigate the scalability of the process by leveraging crawled web images (unfiltered) for UCF101 and ActivityNet. Using unfiltered images we can achieve performance improvements that are on-par with using filtered images. This means we can further reduce annotation labor and easily scale-up to larger problems. We also shed light on an artifact of finetuning CNN models that reduces the effective parameters of the CNN and show that using web action images can significantly alleviate this problem.https://arxiv.org/pdf/1512.07155v1.pdfFirst author draf

    Opening the Vocabulary of Egocentric Actions

    Full text link
    Human actions in egocentric videos are often hand-object interactions composed from a verb (performed by the hand) applied to an object. Despite their extensive scaling up, egocentric datasets still face two limitations - sparsity of action compositions and a closed set of interacting objects. This paper proposes a novel open vocabulary action recognition task. Given a set of verbs and objects observed during training, the goal is to generalize the verbs to an open vocabulary of actions with seen and novel objects. To this end, we decouple the verb and object predictions via an object-agnostic verb encoder and a prompt-based object encoder. The prompting leverages CLIP representations to predict an open vocabulary of interacting objects. We create open vocabulary benchmarks on the EPIC-KITCHENS-100 and Assembly101 datasets; whereas closed-action methods fail to generalize, our proposed method is effective. In addition, our object encoder significantly outperforms existing open-vocabulary visual recognition methods in recognizing novel interacting objects.Comment: 20 pages, 7 figures; https://dibschat.github.io/openvocab-egoAR

    Salient object subitizing

    Full text link
    We study the problem of salient object subitizing, i.e. predicting the existence and the number of salient objects in an image using holistic cues. This task is inspired by the ability of people to quickly and accurately identify the number of items within the subitizing range (1–4). To this end, we present a salient object subitizing image dataset of about 14 K everyday images which are annotated using an online crowdsourcing marketplace. We show that using an end-to-end trained convolutional neural network (CNN) model, we achieve prediction accuracy comparable to human performance in identifying images with zero or one salient object. For images with multiple salient objects, our model also provides significantly better than chance performance without requiring any localization process. Moreover, we propose a method to improve the training of the CNN subitizing model by leveraging synthetic images. In experiments, we demonstrate the accuracy and generalizability of our CNN subitizing model and its applications in salient object detection and image retrieval.This research was supported in part by US NSF Grants 0910908 and 1029430, and gifts from Adobe and NVIDIA. (0910908 - US NSF; 1029430 - US NSF)https://arxiv.org/abs/1607.07525https://arxiv.org/pdf/1607.07525.pdfAccepted manuscrip

    Similarity-Based Processing of Motion Capture Data

    Get PDF
    Motion capture technologies digitize human movements by tracking 3D positions of specific skeleton joints in time. Such spatio-temporal data have an enormous application potential in many fields, ranging from computer animation, through security and sports to medicine, but their computerized processing is a difficult problem. The recorded data can be imprecise, voluminous, and the same movement action can be performed by various subjects in a number of alternatives that can vary in speed, timing or a position in space. This requires employing completely different data-processing paradigms compared to the traditional domains such as attributes, text or images. The objective of this tutorial is to explain fundamental principles and technologies designed for similarity comparison, searching, subsequence matching, classification and action detection in the motion capture data. Specifically, we emphasize the importance of similarity needed to express the degree of accordance between pairs of motion sequences and also discuss the machine-learning approaches able to automatically acquire content-descriptive movement features. We explain how the concept of similarity together with the learned features can be employed for searching similar occurrences of interested actions within a long motion sequence. Assuming a user-provided categorization of example motions, we discuss techniques able to recognize types of specific movement actions and detect such kinds of actions within continuous motion sequences. Selected operations will be demonstrated by on-line web applications

    Uncertainty-based Traffic Accident Anticipation with Spatio-Temporal Relational Learning

    Full text link
    Traffic accident anticipation aims to predict accidents from dashcam videos as early as possible, which is critical to safety-guaranteed self-driving systems. With cluttered traffic scenes and limited visual cues, it is of great challenge to predict how long there will be an accident from early observed frames. Most existing approaches are developed to learn features of accident-relevant agents for accident anticipation, while ignoring the features of their spatial and temporal relations. Besides, current deterministic deep neural networks could be overconfident in false predictions, leading to high risk of traffic accidents caused by self-driving systems. In this paper, we propose an uncertainty-based accident anticipation model with spatio-temporal relational learning. It sequentially predicts the probability of traffic accident occurrence with dashcam videos. Specifically, we propose to take advantage of graph convolution and recurrent networks for relational feature learning, and leverage Bayesian neural networks to address the intrinsic variability of latent relational representations. The derived uncertainty-based ranking loss is found to significantly boost model performance by improving the quality of relational features. In addition, we collect a new Car Crash Dataset (CCD) for traffic accident anticipation which contains environmental attributes and accident reasons annotations. Experimental results on both public and the newly-compiled datasets show state-of-the-art performance of our model. Our code and CCD dataset are available at https://github.com/Cogito2012/UString.Comment: Accepted by ACM MM 202
    corecore